167 research outputs found

    Entities as topic labels: Improving topic interpretability and evaluability combining Entity Linking and Labeled LDA

    Get PDF
    In order to create a corpus exploration method providing topics that are easier to interpret than standard LDA topic models, here we propose combining two techniques called Entity linking and Labeled LDA. Our method identifies in an ontology a series of descriptive labels for each document in a corpus. Then it generates a specific topic for each label. Having a direct relation between topics and labels makes interpretation easier; using an ontology as background knowledge limits label ambiguity. As our topics are described with a limited number of clear-cut labels, they promote interpretability, and this may help quantitative evaluation. We illustrate the potential of the approach by applying it in order to define the most relevant topics addressed by each party in the European Parliament's fifth mandate (1999-2004).Comment: in Proceedings of Digital Humanities 2016, Krako

    Managing educational information on university websites : a proposal for Unibo.it

    Full text link
    This article is focused on the complexity of finding and analyzing the totality of educational information shared by the University of Bologna on its website during the last twenty years. It specifically emphasizes some issues related to the use of the Wayback Machine, the most important international web archive, and the need for a different research tool which would guarantee more solid analyses of the corpus. This tool could initially be characterized by the use of standard Natural Language Processing techniques (such as tokenization, stop-words removal, parsing, etc.) but we also have to take into consideration more complex solutions, such as text mining analyses, WordNet integration and an ontological representation of knowledge. Thanks to approaches like the one here presented, future historians will be able to efficiently study the evolution of a university website

    Event-based Access to Historical Italian War Memoirs

    Full text link
    The progressive digitization of historical archives provides new, often domain specific, textual resources that report on facts and events which have happened in the past; among these, memoirs are a very common type of primary source. In this paper, we present an approach for extracting information from Italian historical war memoirs and turning it into structured knowledge. This is based on the semantic notions of events, participants and roles. We evaluate quantitatively each of the key-steps of our approach and provide a graph-based representation of the extracted knowledge, which allows to move between a Close and a Distant Reading of the collection.Comment: 23 pages, 6 figure

    A discipline-enriched dataset for tracking the computational turn of European universities

    Full text link
    In recent years, academic research appears to have been going through a methodological turning point. The discussion around the impact that computational methods will have on traditional fields of study has been the focus of many calls for papers and panels at established conferences. However, despite the high prevalence of this topic in the academic debate, it remains very challenging to assess whether academia as a whole has been actually adopting more digital resources and methods during the recent years. We are currently studying this topic by combining hermeneutic and text mining practices while analyzing one of the primary research output of European universities, namely doctoral theses. In this work, we present an enriched dataset we created for addressing this research questions and the first results of the analyses we have conducted so far

    Political Text Scaling Meets Computational Semantics

    Full text link
    During the last fifteen years, automatic text scaling has become one of the key tools of the Text as Data community in political science. Prominent text scaling algorithms, however, rely on the assumption that latent positions can be captured just by leveraging the information about word frequencies in documents under study. We challenge this traditional view and present a new, semantically aware text scaling algorithm, SemScale, which combines recent developments in the area of computational linguistics with unsupervised graph-based clustering. We conduct an extensive quantitative analysis over a collection of speeches from the European Parliament in five different languages and from two different legislative terms, and show that a scaling approach relying on semantic document representations is often better at capturing known underlying political dimensions than the established frequency-based (i.e., symbolic) scaling method. We further validate our findings through a series of experiments focused on text preprocessing and feature selection, document representation, scaling of party manifestos, and a supervised extension of our algorithm. To catalyze further research on this new branch of text scaling methods, we release a Python implementation of SemScale with all included data sets and evaluation procedures.Comment: Updated version - accepted for Transactions on Data Science (TDS

    Collecting primary sources from web archives: A tale of scarcity and abundance

    Full text link
    The World Wide Web is the largest collection of human testimonies that we have ever had at our fingertips. Spanning from institutional websites to digital libraries, from personal blogs to Twitter accounts of prominent politicians, from online newspapers to large-scale knowledge bases, an immense number of born-digital testimonies is waiting to be retrieved, selected and studied by future historians. In addition to this, while these new resources are piling up steadily in front of our eyes, they are also rapidly replacing their analogue counterparts, from printed news articles to personal diaries, from letter correspondences to scientific publications. By acknowledging this sudden transition in production from printed to digital documents, the goal of this chapter is to present and discuss some of the new methodological issues that arise when these materials are to be employed as primary sources for studying the recent past

    SLaTE: a system for labeling topics with entities

    Full text link

    L’archiviazione delle pagine dei quotidiani online

    Get PDF
    Questo articolo analizza i metodi utilizzati dai siti d’informazione per permettere ai nativi digitali la preservazione e l’accesso nel corso del tempo ai propri contenuti. Prima di tutto si è descritto quali tipologie di documenti sono presenti su tali siti. In secondo luogo, confrontando gli archivi dei principali quotidiani digitali, si ipotizzano due tipi di interventi possibili: un primo volto a migliorare l’interrogazione per “metadati descrittivi” e un secondo incentrato sull’interrogazione full text attraverso strumenti di ricerca semantica. Si è voluta inoltre sottolineare la necessità di preservare queste testimonianze digitali conservandone il più possibile l’integrità. In conclusione, si evidenzia il legame inscindibile tra la ricerca storica sulle fonti native digitali e gli studi di archivistica informatica.This essay focuses on the methodology on line newspapers adopt in order to guarantee the preservation and the retrieval of their born-digital materials. First of all, it describes what kinds of documents are published on these websites. Then, after an analysis of theirs search-tools, it emphasizes two possible ways to improve them: the first aiming at improving the interrogation by “descriptive metadata” and the other one focused on a full-text interrogation using semantic search. It also stresses the importance of “complete preservation” of these digital resources. In the conclusion, it pointes out the importance of linking web history and the studies on born-digital preservation

    Cross-lingual classification of topics in political texts

    Full text link
    • …
    corecore